14 research outputs found

    Cycle-consistent Adversarial Networks for Non-parallel Vocal Effort Based Speaking Style Conversion

    Get PDF
    Speaking style conversion (SSC) is the technology of converting natural speech signals from one style to another. In this study, we propose the use of cycle-consistent adversarial networks (CycleGANs) for converting styles with varying vocal effort, and focus on conversion between normal and Lombard styles as a case study of this problem. We propose a parametric approach that uses the Pulse Model in Log domain (PML) vocoder to extract speech features. These features are mapped using the CycleGAN from utterances in the source style to the corresponding features of target speech. Finally, the mapped features are converted to a Lombard speech waveform with the PML. The CycleGAN was compared in subjective listening tests with 2 other standard mapping methods used in conversion, and the CycleGAN was found to have the best performance in terms of speech quality and in terms of the magnitude of the perceptual change between the two styles.Peer reviewe

    L'anàlisi de les representacions basades en similitud dades heterogènies

    No full text
    [ANGLÈS] The most common methodology of similarity based learning is the k-nearest neighbour method. But this has the disadvantage of having to store and calculate distances (similarities) for all the instances in the dataset. A more efficient methodology is to calculate similarities to only a set of chosen prototypes and use this as a new representation for learning. The current thesis deals with the implementation of these ideas using the Gower’s similarity measure for heterogeneous data. The comparison of clustering and feature selection methods for prototype selection is explored. Different methodologies are implemented in an attempt to improve the Gower’s similarity measure with the incorporation of weights for features. Novel methodologies to extract deep/higher level features from the similarity representation are proposed. The thesis provides preliminary results in these areas of research which are encouraging.[CASTELLÀ] La metodología más común de aprendizaje basada en la similitud es el método del vecino k-más cercano. Pero esto tiene la desventaja de tener que almacenar y calcular distancias (similitudes) para todos los casos del conjunto de datos. Una metodología más eficiente es calcular similitudes con sólo un conjunto de prototipos seleccionados y utilizar esto como una nueva representación para el aprendizaje. La presente tesis trata de la puesta en práctica de estas ideas usando medida de similitud de Gower de datos heterogéneos. La comparación de la agrupación y la selección de características para la selección de métodos de prototipo se explora. Diferentes metodologías se implementan en un intento de mejorar la medida de similitud de la Gower, con la incorporación de los pesos para las características. Se proponen metodologías novedosas para extraer características de nivel de profundidad / superiores de la representación similitud. La tesis ofrece resultados preliminares en estas áreas de investigación que son alentadores[CATALÀ] La metodologia més comuna d'aprenentatge basada en la similitud és el mètode del veí k-més proper. Però això té el desavantatge d'haver de emmagatzemar i calcular distàncies (similituds) per a tots els casos del conjunt de dades. Una metodologia més eficient és calcular similituds amb només un conjunt de prototips seleccionats i utilitzar això com una nova representació per a l'aprenentatge. Aquesta tesi tracta de la posada en pràctica d'aquestes idees utilitzant mesura de similitud de Gower de dades heterogènies. La comparació de l'agrupació i la selecció de característiques per a la selecció de mètodes de prototip s'explora. Diferents metodologies s'implementen en un intent de millorar la mesura de similitud de la Gower, amb la incorporació dels pesos per a les característiques. Es proposen metodologies noves per extreure característiques de nivell de profunditat / superiors de la representació similitud. La tesi ofereix resultats preliminars en aquestes àrees de recerca que són encoratjadors

    Machine learning methods for suprasegmental analysis and conversion in speech

    No full text
    Speech technology is a field of technological research focusing on methods to process spoken language. Work in the area has largely relied on a combination of domain-specific knowledge and digital signal processing (DSP) algorithms, often combined with statistical (parametric) models. In this context, machine learning (ML) has played a central role in estimating the parameters of such models. Recently, better access to large quantities of data has opened the door to advanced ML models that are less constrained by the assumptions necessary for the DSP models and are potentially capable of achieving higher performance. The goal of this thesis is to investigate the applicability of recent state-of-the-art (SoA) developments in ML to the modelling and processing of speech at the so-called suprasegmental level to tackle the following topical problems in speech research: 1) zero-resource speech processing (ZS), which aims to learn language patterns from speech without access to annotated datasets, 2) automatic word (WCE) and syllable (SCE) count estimation which focus on quantifying the amount of linguistic content in audio recordings, and 3) speaking style conversion (SSC), which deals with the conversion of the speaking style of an utterance while retaining the linguistic content, speaking identity and quality. In contrast to the segmental level which consists of elementary speech units known as phone(me)s, the suprasegmental level encodes more slowly varying characteristics of speech such as the speaker identity, speaking style, prosody and emotion. The ML-approaches used in the thesis are non-parametric Bayesian (NPB) models, which have a strong mathematical foundation based on Bayesian statistics, and artificial neural networks (NNs), which are universal function approximators capable of leveraging large quantities of training data. The NN variants used include 1) end-to-end models that are capable of learning complicated mapping functions without the need to explicitly model the intermediate steps, and 2) generative adversarial networks (GANs), which are based on training a minimax game between two competing NNs. In ZS, NPB clustering methods were investigated for the discovery of syllabic clusters from speech and were shown to eliminate the need for model selection. In the WCE/SCE task, a novel end-to-end model was developed for automatic and language-independent syllable counting from speech. The method improved the syllable counting accuracy by approximately 10 percentage points from the previously published SoA method while relaxing the requirements of the data annotation used for the model training. As for SSC, a new parametric approach was introduced for the task. Bayesian models were first studied with parallel data, followed by GAN-based solutions for non-parallel data. GAN-based models were shown to achieve SoA performance in terms of both subjective and objective measures and without access to parallel data. Augmented CycleGANs also enable manual control of the degree of style conversion achieved in the SSC task

    L'anàlisi de les representacions basades en similitud dades heterogènies

    No full text
    [ANGLÈS] The most common methodology of similarity based learning is the k-nearest neighbour method. But this has the disadvantage of having to store and calculate distances (similarities) for all the instances in the dataset. A more efficient methodology is to calculate similarities to only a set of chosen prototypes and use this as a new representation for learning. The current thesis deals with the implementation of these ideas using the Gower’s similarity measure for heterogeneous data. The comparison of clustering and feature selection methods for prototype selection is explored. Different methodologies are implemented in an attempt to improve the Gower’s similarity measure with the incorporation of weights for features. Novel methodologies to extract deep/higher level features from the similarity representation are proposed. The thesis provides preliminary results in these areas of research which are encouraging.[CASTELLÀ] La metodología más común de aprendizaje basada en la similitud es el método del vecino k-más cercano. Pero esto tiene la desventaja de tener que almacenar y calcular distancias (similitudes) para todos los casos del conjunto de datos. Una metodología más eficiente es calcular similitudes con sólo un conjunto de prototipos seleccionados y utilizar esto como una nueva representación para el aprendizaje. La presente tesis trata de la puesta en práctica de estas ideas usando medida de similitud de Gower de datos heterogéneos. La comparación de la agrupación y la selección de características para la selección de métodos de prototipo se explora. Diferentes metodologías se implementan en un intento de mejorar la medida de similitud de la Gower, con la incorporación de los pesos para las características. Se proponen metodologías novedosas para extraer características de nivel de profundidad / superiores de la representación similitud. La tesis ofrece resultados preliminares en estas áreas de investigación que son alentadores[CATALÀ] La metodologia més comuna d'aprenentatge basada en la similitud és el mètode del veí k-més proper. Però això té el desavantatge d'haver de emmagatzemar i calcular distàncies (similituds) per a tots els casos del conjunt de dades. Una metodologia més eficient és calcular similituds amb només un conjunt de prototips seleccionats i utilitzar això com una nova representació per a l'aprenentatge. Aquesta tesi tracta de la posada en pràctica d'aquestes idees utilitzant mesura de similitud de Gower de dades heterogènies. La comparació de l'agrupació i la selecció de característiques per a la selecció de mètodes de prototip s'explora. Diferents metodologies s'implementen en un intent de millorar la mesura de similitud de la Gower, amb la incorporació dels pesos per a les característiques. Es proposen metodologies noves per extreure característiques de nivell de profunditat / superiors de la representació similitud. La tesi ofereix resultats preliminars en aquestes àrees de recerca que són encoratjadors

    SylNet

    No full text
    Automatic syllable count estimation (SCE) is used in a variety of applications ranging from speaking rate estimation to detecting social activity from wearable microphones or developmental research concerned with quantifying speech heard by language-learning children in different environments. The majority of previously utilized SCE methods have relied on heuristic digital signal processing (DSP) methods, and only a small number of bi-directional long short-term memory (BLSTM) approaches have made use of modern machine learning approaches in the SCE task. This letter presents a novel end-to-end method called SylNet for automatic syllable counting from speech, built on the basis of a recent developments in neural network architectures. We describe how the entire model can be optimized directly to minimize SCE error on the training data without annotations aligned at the syllable level, and how it can be adapted to new languages using limited speech data with known syllable counts. Experiments on several different languages reveal that SylNet generalizes to languages beyond its training data and further improves with adaptation. It also outperforms several previously proposed methods for syllabification, including end-to-end BLSTMs.Peer reviewe

    Comparison of syllabification algorithms and training strategies for robust word count estimation across different languages and recording conditions

    No full text
    Word count estimation (WCE) from audio recordings has a number of applications, including quantifying the amount of speech that language-learning infants hear in their natural environments, as captured by daylong recordings made with devices worn by infants. To be applicable in a wide range of scenarios and also low-resource domains, WCE tools should be extremely robust against varying signal conditions and require minimal access to labeled training data in the target domain. For this purpose, earlier work has used automatic syllabification of speech, followed by a least-squares-mapping of syllables to word counts. This paper compares a number of previously proposed syllabifiers in the WCE task, including a supervised bi-directional long short-term memory (BLSTM) network that is trained on a language for which high quality syllable annotations are available (a “high resource language”), and reports how the alternative methods compare on different languages and signal conditions. We also explore additive noiseand varying-channel data augmentation strategies for BLSTM training, and show how they improve performance in both matching and mismatching languages. Intriguingly, we also find that even though the BLSTM works on languages beyond its training data, the unsupervised algorithms can still outperform it in challenging signal conditions on novel languages.Peer reviewe

    Augmented CycleGANs for continuous scale normal-to-Lombard speaking style conversion

    No full text
    Lombard speech is a speaking style associated with increased vocal effort that is naturally used by humans to improve intelligibility in the presence of noise. It is hence desirable to have a system capable of converting speech from normal to Lombard style. Moreover, it would be useful if one could adjust the degree of Lombardness in the converted speech so that the system is more adaptable to different noise environments. In this study, we propose the use of recently developed Augmented cycle-consistent adversarial networks (Augmented CycleGANs) for conversion between normal and Lombard speaking styles. The proposed system gives a smooth control on the degree of Lombardness of the mapped utterances by traversing through different points in the latent space of the trained model. We utilize a parametric approach that uses the Pulse Model in Log domain (PML) vocoder to extract features from normal speech that are then mapped to Lombard-style features using the Augmented CycleGAN. Finally, the mapped features are converted to Lombard speech with PML. The model is trained on multi-language data recorded in different noise conditions, and we compare its effectiveness to a previously proposed CycleGAN system in experiments for intelligibility and quality of mapped speech.Peer reviewe

    Vocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learning

    No full text
    Speaking style conversion (SSC) is the technology of converting natural speech signals from one style to another. In this study, we aim to provide a general SSC system for converting styles with varying vocal effort and focus on normal-to-Lombard conversion as a case study of this problem. We propose a parametric approach that uses a vocoder to extract speech features. These features are mapped using parallel machine learning models from utterances spoken in normal style to the corresponding features of Lombard speech. Finally, the mapped features are converted to a Lombard speech waveform with the vocoder. A total of three vocoders (GlottDNN, STRAIGHT, and Pulse model in log domain (PML)) and three machine learning mapping methods (standard GMM, Bayesian GMM, and feed-forward DNN) were compared in the proposed normal-to-Lombard style conversion system. The conversion was evaluated using two subjective listening tests measuring perceived Lombardness and quality of the converted speech signals, and by using aninstrumental measure called Speech Intelligibility in Bits (SIIB) for speech intelligibility evaluation under various noise levels. The results of the subjective tests show that the system is able to convert normal speech into Lombard speech and that there is a trade-off between quality and Lombardness of the mapped utterances. The GlottDNN and PML stand out as the best vocoders in terms of quality and Lombardness, respectively, whereas the DNN is the best mapping method in terms of Lombardness. PML with the standard GMM seems to give a good compromise between the two attributes. The SIIB experiments indicate that intelligibility of converted speech compared to that of normal speech improved in noisy conditions most effectively when DNN mapping was used with STRAIGHT and PML.Peer reviewe
    corecore